Identification of Adopted Pali Words in Myanmar Text
نویسنده
چکیده
Myanmar language has been significantly influenced by Pali language due to the practice of Buddhism and study of Buddhist literature in Myanmar. As a result, Pali words have been widely adopted and used in Myanmar language. This study presents an algorithm for identifying Myanmar-adopted Pali words in Myanmar text. The system employs a combination of rule-based syllable segmentation and a dictionary-based longest matching method. A program was developed and trained on a corpus containing 8,895 sentences. It recognized 579 unique Pali words. The accuracy of the system was tested on a different corpus containing 3,641 sentences and the system correctly identified 279 unique Pali words, achieving a Precision of 97.59%, Recall of 99.04% and F-measure of 98.31%. Usages of Pali words are inevitable in Myanmar text and the results of this study will improve many NLP tasks of Myanmar language such as spelling checking, text categorization and text-to-speech synthesis etc.
منابع مشابه
A Rule-based Syllable Segmentation of Myanmar Text
Myanmar script uses no space between words and syllable segmentation represents a significant process in many NLP tasks such as word segmentation, sorting, line breaking and so on. In this study, a rulebased approach of syllable segmentation algorithm for Myanmar text is proposed. Segmentation rules were created based on the syllable structure of Myanmar script and a syllable segmentation algor...
متن کاملAuthor gender identification from text using Bayesian Random Forest
Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...
متن کاملUnsupervised Morphemes Segmentation
In this work, we describe the algorithm adopted to split the words into smallest possible meaningful units or morphemes. The algorithm is unsupervised and not dependent on any language. The model is developed using English language. However, the linguistic rules specific to English language are not implemented. The algorithm focuses on the identification of smallest units of words based on thei...
متن کاملMyanmar Word Segmentation using Syllable level Longest Matching
In Myanmar language, sentences are clearly delimited by a unique sentence boundary marker but are written without necessarily pausing between words with spaces. It is therefore non-trivial to segment sentences into words. Word tokenizing plays a vital role in most Natural Language Processing applications. We observe that word boundaries generally align with syllable boundaries. Working directly...
متن کاملMyanmar Number Normalization for Text-to-Speech
--Text Normalization is an essential module for Text-to-Speech (TTS) system as TTS systems need to work on real text. This paper describes Myanmar number normalization designed for Myanmar Text-to-Speech system. Semiotic classes forMyanmar language are identified by the study of Myanmar text corpus and Weighted Finite State Transducers (WFST) based Myanmar number normalization is implemented. N...
متن کامل